NO-ISSUE: Restrict test retries to an allowlist of test names by mkowalski · Pull Request #31222 · openshift/origin

mkowalski · 2026-05-27T12:48:47Z

Summary

Replace the image-tag-based and hardcoded retry eligibility in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.
Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML to get a retry.
The allowlist is populated from BigQuery retry_statistics data, scoped to release 4.22, last 30 days, with a threshold of >= 5 flake occurrences. This yields 206 tests.

Behavior

Scenario	Retries?
Test listed in the YAML allowlist	Yes
Test NOT in the allowlist	No
Empty allowlist (`tests: []`)	No

How to allow retries for a test

Add the full test name to pkg/test/ginkgo/retry_allowed_tests.yaml:

tests:
  - "[sig-network] Services should be rejected for evicted pods ..."
  - "[sig-node] Pods Extended Pod Container lifecycle ..."

The file is embedded at compile time (go:embed), so changes require rebuilding openshift-tests.

How to regenerate the allowlist from BigQuery

SELECT
  rs.TestName,
  COUNT(*) AS flake_count
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
  ON rs.JobName = jv.job_name
  AND jv.variant_name = 'Release'
  AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
  AND rs.TestName IS NOT NULL AND rs.TestName != ''
  AND rs._PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESC

The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).

Summary by CodeRabbit

Tests
- Retry behavior refactored to use an embedded allowlist: only tests present in the allowlist are eligible for a single retry.
- Added tests verifying allowlist-driven retry decisions and confirming the allowlist loader returns a valid result.
- Retry decisions now emit debug logs for allow/deny outcomes and default to no retries if the allowlist cannot be parsed.

coderabbitai · 2026-05-27T12:53:31Z

Walkthrough

The pull request refactors test retry eligibility in pkg/test/ginkgo/retries.go from complex in-function rules to an embedded YAML allowlist. RetryOnceStrategy now stores and uses an in-memory allowlist of retry-eligible test names; shouldRetryTest permits retries only when a test name is in the allowlist. Tests verify both the allowlist loader and the updated retry behavior.

Changes

Test Retry Allowlist Refactor

Layer / File(s)	Summary
Allowlist infrastructure and initialization `pkg/test/ginkgo/retries.go`, `pkg/test/ginkgo/retry_allowed_tests.yaml`	Embeds `retry_allowed_tests.yaml` and defines YAML schema via `loadRetryAllowedTests()`, which unmarshals into a `map[string]bool`. `RetryOnceStrategy` struct gains `AllowedRetryTests` field and initializes it in `NewRetryOnceStrategy()`.
Retry eligibility logic refactor `pkg/test/ginkgo/retries.go`	`shouldRetryTest` now permits retries only when `test.name` is present in `AllowedRetryTests` with debug logging; prior image-tag and binary-presence rules are removed.
Test coverage for allowlist mechanism `pkg/test/ginkgo/retries_test.go`	Table-driven test (`TestRetryOnceStrategy_ShouldRetryTest_Allowlist`) validates retry behavior with a supplied allowlist map. New `TestLoadRetryAllowedTests` verifies the embedded loader returns a non-nil map.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in modified files are static, descriptive strings. No Ginkgo test titles with dynamic content (UUIDs, timestamps, generated suffixes) found.
Test Structure And Quality	✅ Passed	The PR adds standard Go unit tests, not Ginkgo Describe/Context/It block tests. Custom check for Ginkgo test code is not applicable.
Microshift Test Compatibility	✅ Passed	PR does not add new Ginkgo e2e tests. Changes are framework code only: refactored retry logic, unit tests using Go testing, and a YAML configuration file.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests added in this PR. Changes are to test infrastructure (retry strategy) and Go unit tests, not Ginkgo e2e test declarations.
Topology-Aware Scheduling Compatibility	✅ Passed	This PR only modifies test infrastructure in pkg/test/ginkgo/ (test retry eligibility logic). No deployment manifests, operator code, or controllers introducing scheduling constraints are modified.
Ote Binary Stdout Contract	✅ Passed	No OTE Contract violations. YAML parsing via yaml.Unmarshal() doesn't write to stdout, and error logging uses logrus which writes to stderr by default.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests added. PR modifies retry logic infrastructure and adds standard Go unit tests (testing.T), not Ginkgo e2e tests. Check not applicable.
No-Weak-Crypto	✅ Passed	No weak cryptographic patterns found. Code uses only standard libraries for YAML parsing and test retry allowlisting; no crypto, hash, or TLS imports present.
Container-Privileges	✅ Passed	PR does not contain K8s container manifests or security configurations. Changes are limited to test retry logic in Go code and a YAML list of eligible test names.
No-Sensitive-Data-In-Logs	✅ Passed	All logging only exposes non-sensitive data: YAML errors, test names, retry metrics, and job metadata. No passwords, tokens, API keys, PII, or credentials are logged.
Title check	✅ Passed	The title 'Restrict test retries to an allowlist of test names' accurately and concisely summarizes the main change: replacing image-tag-based retry eligibility with a YAML allowlist as the sole source of truth for test retry permissions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-merge-bot · 2026-05-27T12:54:07Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries_test.go`:
- Around line 101-107: The test TestLoadRetryAllowedJobs is not detecting parse
failures because loadRetryAllowedJobs hides errors; change the loader to expose
parse errors and update the test to assert them: modify loadRetryAllowedJobs (or
add a new helper like loadRetryAllowedJobsWithError) to return (map[string]bool,
error) and have the test call that function, assert error == nil and the
returned map is non-nil/non-empty; update callers accordingly (or keep a thin
wrapper loadRetryAllowedJobs that panics on error for production code) so the
test will fail when the embedded YAML is malformed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9a11f0a6-4dbc-4d47-8df4-a2e6327e42d7

📥 Commits

Reviewing files that changed from the base of the PR and between 355916d and f7fa00e.

📒 Files selected for processing (3)

pkg/test/ginkgo/retries.go
pkg/test/ginkgo/retries_test.go
pkg/test/ginkgo/retry_allowed_jobs.yaml

mkowalski · 2026-05-27T13:25:07Z

@coderabbitai help

coderabbitai · 2026-05-27T13:25:14Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai rate limit to show your current review rate limit status.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
@coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
@coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai emit path instructions to emit generated path-specific review instructions for this repository.
@coderabbitai resolve merge conflict to automatically resolve merge conflicts.
@coderabbitai autofix to automatically fix issues identified in unresolved review comments.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

mkowalski · 2026-05-27T13:25:39Z

@coderabbitai summary
@coderabbitai full review

coderabbitai · 2026-05-27T13:25:53Z

@mkowalski Sure! I'll regenerate the PR summary and kick off a full review of all the changes.

✅ Actions performed

Summary regeneration triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries.go`:
- Around line 16-37: The current code implements a test-level allowlist
(retryAllowedTestsYAML, retryAllowedTestsConfig, loadRetryAllowedTests) but the
PR requires a job-level allowlist driven by JOB_NAME; update the implementation
to match by adding a job-level loader (e.g., loadRetryAllowedJobs reading
retry_allowed_jobs.yaml into a struct like Jobs []string) and change the retry
decision logic to consult os.Getenv("JOB_NAME") against that allowlist before
allowing retries, or alternatively update the PR text to state this is
test-level behavior; locate uses of loadRetryAllowedTests and the retry decision
point and replace or supplement them to check JOB_NAME membership in the new
jobs set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3d187d45-503b-4721-9d24-973c1daa5c5b

📥 Commits

Reviewing files that changed from the base of the PR and between f7fa00e and 0412516.

📒 Files selected for processing (3)

pkg/test/ginkgo/retries.go
pkg/test/ginkgo/retries_test.go
pkg/test/ginkgo/retry_allowed_tests.yaml

mkowalski · 2026-05-27T14:18:18Z

/cc @dgoodwin

dgoodwin · 2026-05-27T14:55:51Z

The query claude came up with is better than I would have, I did not know that table exists, and it's perfect for our purposes.

I suggest this query:

  SELECT
    rs.TestName,
    COUNT(*) AS flake_count,
    COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
    SUM(rs.TotalAttempts) AS total_attempts,
    SUM(rs.FailedAttempts) AS total_failed_attempts
  FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
  INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
    ON rs.JobName = jv.job_name
    AND jv.variant_name = 'Release'
    AND jv.variant_value = '4.22'
  WHERE rs.FinalOutcome = 'flaky'
    AND rs.TestName IS NOT NULL AND rs.TestName != ''
    AND rs.PartitionTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  GROUP BY rs.TestName
  HAVING COUNT(*) >= 5
  ORDER BY flake_count DESC

We're not concerned with flakes in old releases, and filtering to the past 30 days seems logical. 4.22 has the best data over the past month as it's the most stable.

If a test flaked only once in the last month, I'm fine removing it's right to flake. We may have to add more to this list over time, things could surface in CR, but hopefully that will be rare. 5 seems like a good threshold, and that should be about 207 tests.

Threshold options from claude:

┌───────────┬────────────┬────────────┬──────────────────────┐
│ Threshold │ Tests kept │ % of tests │ Flake events covered │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 2 │ 395 │ 81% │ 99.0% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 3 │ 316 │ 65% │ 97.3% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 5 │ 207 │ 42.5% │ 93.2% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 10 │ 113 │ 23.2% │ 86.6% │
└───────────┴────────────┴────────────┴──────────────────────┘

We could go lower though, 2-3 would also be reasonable IMO.

dgoodwin · 2026-05-27T15:03:50Z


 func NewRetryOnceStrategy() *RetryOnceStrategy {
 	return &RetryOnceStrategy{
-		PermittedRetryImageTags: []string{"tests"}, // tests = openshift-tests image


@stbenjam this looks safe to me, but would appreciate if you can confirm. Mat is closing the retry gap at last, with an explicit list of tests that need retries only allowed.

dgoodwin · 2026-05-27T15:06:18Z

-		"[sig-node] Pods Extended Pod Container lifecycle evicted pods should be terminal [Suite:openshift/conformance/parallel] [Suite:k8s]",                                                                                                                                           // https://issues.redhat.com/browse/OCPBUGS-57658
-		"[sig-cli] Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment [Suite:openshift/conformance/parallel] [Suite:k8s]",                                                             // https://issues.redhat.com/browse/OCPBUGS-61287
-		"[sig-cli] Kubectl Port forwarding Shutdown client connection while the remote stream is writing data to the port-forward connection port-forward should keep working after detect broken connection [Suite:openshift/conformance/parallel] [Suite:k8s]",                        // https://issues.redhat.com/browse/OCPBUGS-61734
-		"[sig-storage] OCP CSI Volumes [Driver: csi-hostpath-groupsnapshot] [OCPFeatureGate:VolumeGroupSnapshot] [Testpattern:  (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod", // https://issues.redhat.com/browse/OCPBUGS-66967


Please check if all these are in the list with the new query I added in the PR, if not we should add them.

Also could you ensure the list is sorted, the query I provided isn't. :)

The list is now sorted. The "migrated" whitelisted tests are there. After running your revised query we are down to ~200 tests only.

dgoodwin · 2026-05-27T15:07:05Z

Code looks good, just a revision to the query and some sorting concerns.

Replace the image-tag-based and hardcoded retry eligibility logic in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure. Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML file to get a retry. Tests not on the list must pass on the first attempt. The allowlist is populated from BigQuery retry_statistics data: all 1395 tests that historically failed first but passed on retry are included. The 6 previously hardcoded exception tests (OCPBUGS-57477, OCPBUGS-57665, OCPBUGS-57658, OCPBUGS-61287, OCPBUGS-61734, OCPBUGS-66967) are covered by the BigQuery data and no longer need special treatment in Go code. The list should be reduced over time as flaky tests are fixed. To disable retries entirely, empty the YAML list.

coderabbitai

🧹 Nitpick comments (1)

pkg/test/ginkgo/retry_allowed_tests.yaml (1)

8-23: ⚡ Quick win

Consider enriching the documented query with additional metrics.

The BigQuery query documents how to regenerate the allowlist, but it only returns TestName and flake_count. Based on the PR comments, a more informative query would also include distinct_jobs_affected, total_attempts, and total_failed_attempts to help maintainers make better decisions when regenerating the list.

📊 Suggested enhanced query

SELECT
  rs.TestName,
  COUNT(*) AS flake_count,
  COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
  SUM(rs.TotalAttempts) AS total_attempts,
  SUM(rs.FailedAttempts) AS total_failed_attempts
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
  ON rs.JobName = jv.job_name
  AND jv.variant_name = 'Release'
  AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
  AND rs.TestName IS NOT NULL AND rs.TestName != ''
  AND rs._PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESC

This provides better visibility into the scope and impact of each flaky test.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/test/ginkgo/retry_allowed_tests.yaml` around lines 8 - 23, Update the
documented BigQuery in pkg/test/ginkgo/retry_allowed_tests.yaml (the query that
selects from retry_statistics aliased as rs and joins job_variants as jv) to
include the additional metrics: COUNT(DISTINCT rs.JobName) AS
distinct_jobs_affected, SUM(rs.TotalAttempts) AS total_attempts, and
SUM(rs.FailedAttempts) AS total_failed_attempts; keep the existing WHERE, JOIN,
GROUP BY rs.TestName, HAVING COUNT(*) >= 5 and ordering by flake_count so the
query returns TestName, flake_count plus the three new aggregate columns to
provide better context for maintainers.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/test/ginkgo/retry_allowed_tests.yaml`:
- Around line 8-23: Update the documented BigQuery in
pkg/test/ginkgo/retry_allowed_tests.yaml (the query that selects from
retry_statistics aliased as rs and joins job_variants as jv) to include the
additional metrics: COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
SUM(rs.TotalAttempts) AS total_attempts, and SUM(rs.FailedAttempts) AS
total_failed_attempts; keep the existing WHERE, JOIN, GROUP BY rs.TestName,
HAVING COUNT(*) >= 5 and ordering by flake_count so the query returns TestName,
flake_count plus the three new aggregate columns to provide better context for
maintainers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 89a30401-3429-4f1c-8564-5069943c9079

📥 Commits

Reviewing files that changed from the base of the PR and between 0412516 and 8b792da.

📒 Files selected for processing (3)

pkg/test/ginkgo/retries.go
pkg/test/ginkgo/retries_test.go
pkg/test/ginkgo/retry_allowed_tests.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/test/ginkgo/retries_test.go
pkg/test/ginkgo/retries.go

openshift-merge-bot · 2026-05-28T09:50:29Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

dgoodwin · 2026-05-28T11:23:25Z

/lgtm

Nice when one of our strategic initiatives is closed out in a couple hours and one PR :) Thanks!

openshift-ci · 2026-05-28T11:24:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin,mkowalski]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-05-28T13:35:55Z

@mkowalski: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

Replace the image-tag-based and hardcoded retry eligibility in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.

Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML to get a retry.

The allowlist is populated from BigQuery retry_statistics data, scoped to release 4.22, last 30 days, with a threshold of >= 5 flake occurrences. This yields 206 tests.

Behavior

Scenario Retries?

Test listed in the YAML allowlist Yes

Test NOT in the allowlist No

Empty allowlist (tests: []) No

How to allow retries for a test

Add the full test name to pkg/test/ginkgo/retry_allowed_tests.yaml:
tests:
 - "[sig-network] Services should be rejected for evicted pods ..."
 - "[sig-node] Pods Extended Pod Container lifecycle ..."
The file is embedded at compile time (go:embed), so changes require rebuilding openshift-tests.

How to regenerate the allowlist from BigQuery
SELECT
 rs.TestName,
 COUNT(*) AS flake_count
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
 ON rs.JobName = jv.job_name
 AND jv.variant_name = 'Release'
 AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
 AND rs.TestName IS NOT NULL AND rs.TestName != ''
 AND rs._PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESC
The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).

Summary by CodeRabbit

Tests

Retry behavior refactored to use an embedded allowlist: only tests present in the allowlist are eligible for a single retry.

Added tests verifying allowlist-driven retry decisions and confirming the allowlist loader returns a valid result.

Retry decisions now emit debug logs for allow/deny outcomes and default to no retries if the allowlist cannot be parsed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

mkowalski · 2026-05-29T08:33:07Z

/test e2e-metal-ipi-ovn-ipv6

mkowalski · 2026-05-29T08:33:19Z

/verified by later

openshift-ci-robot · 2026-05-29T08:33:32Z

@mkowalski: This PR has been marked as verified by later.

Details

In response to this:

/verified by later

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-29T12:30:26Z

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

stbenjam · 2026-05-29T12:35:04Z

+	var config retryAllowedTestsConfig
+	if err := yaml.Unmarshal(retryAllowedTestsYAML, &config); err != nil {
+		logrus.WithError(err).Error("Failed to parse retry_allowed_tests.yaml, no tests will be allowed to retry")
+		return map[string]bool{}


This should've used the kube set type, FWIW

origin/pkg/clioptions/clusterdiscovery/cluster.go

Line 90 in a29f970

APIGroups sets.Set[string] `json:"-"`

stbenjam · 2026-05-29T12:40:02Z

Sorry I only got around to looking at this now, logic looks correct to me 👍🏻

openshift-ci Bot requested review from deads2k and sjenning May 27, 2026 12:51

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026

coderabbitai Bot requested changes May 27, 2026

View reviewed changes

Comment thread pkg/test/ginkgo/retries_test.go Outdated

mkowalski force-pushed the retry-allowlist branch from 9d45683 to 0412516 Compare May 27, 2026 13:21

mkowalski changed the title ~~Restrict test retries to an allowlist of CI jobs~~ Restrict test retries to an allowlist of test names May 27, 2026

coderabbitai Bot requested changes May 27, 2026

View reviewed changes

Comment thread pkg/test/ginkgo/retries.go

coderabbitai Bot approved these changes May 27, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 27, 2026

openshift-ci Bot requested a review from dgoodwin May 27, 2026 14:18

dgoodwin reviewed May 27, 2026

View reviewed changes

mkowalski force-pushed the retry-allowlist branch from 0412516 to 8b792da Compare May 28, 2026 08:59

mkowalski force-pushed the retry-allowlist branch from 8b792da to cf2226e Compare May 28, 2026 09:04

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

openshift-ci Bot assigned dgoodwin May 28, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026

mkowalski changed the title ~~Restrict test retries to an allowlist of test names~~ NO-ISSUE: Restrict test retries to an allowlist of test names May 28, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 28, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 29, 2026

openshift-merge-bot Bot merged commit 732285e into openshift:main May 29, 2026
21 checks passed

stbenjam reviewed May 29, 2026

View reviewed changes

Uh oh!

Conversation

mkowalski commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

How to allow retries for a test

How to regenerate the allowlist from BigQuery

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

openshift-merge-bot Bot commented May 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkowalski commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Chat

CodeRabbit commands

Other keywords and placeholders

Status, support, documentation and community

Uh oh!

mkowalski commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkowalski commented May 27, 2026

Uh oh!

dgoodwin commented May 27, 2026

Uh oh!

dgoodwin May 27, 2026

Choose a reason for hiding this comment

Uh oh!

dgoodwin May 27, 2026

Choose a reason for hiding this comment

Uh oh!

mkowalski May 28, 2026

Choose a reason for hiding this comment

Uh oh!

dgoodwin commented May 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented May 28, 2026

Uh oh!

dgoodwin commented May 28, 2026

Uh oh!

openshift-ci Bot commented May 28, 2026

Uh oh!

openshift-ci-robot commented May 28, 2026

Summary

Behavior

How to allow retries for a test

How to regenerate the allowlist from BigQuery

Summary by CodeRabbit

Uh oh!

mkowalski commented May 29, 2026

Uh oh!

mkowalski commented May 29, 2026

Uh oh!

openshift-ci-robot commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

Uh oh!

stbenjam May 29, 2026

Choose a reason for hiding this comment

mkowalski commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading